On Separation of English Numerals from Multilingual Document Images

نویسندگان

  • Basanna V. Dhandra
  • Mallikarjun Hangarge
چکیده

For Optical Character Recognition (OCR) of bilingual or multilingual document containing text words in regional language and numerals in English, it is necessary to identify different script forms before running an individual OCR of the scripts. In this paper, an attempt is made for separation of English numerals at word level from bilingual and trilingual documents representing Kannada, Devnagari, Tamil, Odiya and Malayalam scripts by using discriminating features such as aspect ratio, strokes densities, eccentricity, etc. as a tool. The k-nearest neighbour algorithm is used to classify the new word images and the algorithm is tested on 6000 sample words with a five fold cross validation test. The algorithm is robust with respect to font styles, sizes and noise. The results obtained are quite encouraging.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Identification of Printed Punjabi Words and English Numerals Using Gabor Features

Script identification is one of the challenging steps in the development of optical character recognition system for bilingual or multilingual documents. In this paper an attempt is made for identification of English numerals at word level from Punjabi documents by using Gabor features. The support vector machine (SVM) classifier with five fold cross validation is used to classify the word imag...

متن کامل

Script Identification from Printed Document Images Using Statistical Features

Automatic identification of a script in a document image facilitates many important applications such as automatic archiving of multilingual documents; searching online archives of document images and for the selection of script specific OCR in a multilingual environment. In this work a technique for script identification from document images is proposed. The method uses vertical and horizontal...

متن کامل

Monothetic Separation of Telugu, Hindi and English Text Lines From a Multilingual

In a multi-script multi-lingual environment, a document may contain text lines in more than one script/language forms. It is necessary to identify different script regions of the document in order to feed the document to the OCRs of individual language. With this context, this paper proposes to develop a monothetic algorithmic model to identify and separate text lines Telugu, Hindi and English ...

متن کامل

رفع اعوجاج هندسی متون به‌کمک اطلاعات هندسی خطوط متن

Document images produced by scanners or digital cameras usually have photometric and geometric distortions. If either of these effects distorts document, recognition of words from such a document image using OCR is subject to errors. In this paper we propose a novel approach to significantly remove geometric distortion from document images. In this method first we extract document lines from do...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Journal of Multimedia

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2007